Auto-detect audio format in OpenAISpeechToTextClient#7575
Conversation
…#7543) When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format. - Add DetectAudioExtension using Span.SequenceEqual for readability - Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm) - Add unit tests covering each magic-byte detection branch - Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates OpenAISpeechToTextClient to auto-detect audio format (wav/webm/m4a/mp3) from leading “magic bytes” when the provided audio stream is not a FileStream, and uses the detected extension in the multipart filename so OpenAI can correctly infer the format (fixing 400s for non-MP3 MemoryStream inputs).
Changes:
- Add stream-header “magic byte” detection and filename resolution logic in
OpenAISpeechToTextClient. - Add unit tests validating filename selection for each supported format and branch.
- Add integration coverage for multiple embedded audio formats and enhance multipart handler assertions to validate the uploaded filename.
Reviewed changes
Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs | Adds filename resolution with magic-byte detection for non-FileStream inputs. |
| test/Libraries/Microsoft.Extensions.AI.OpenAI.Tests/OpenAISpeechToTextClientTests.cs | Adds theory-based unit tests asserting detected multipart filenames for different headers. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/VerbatimMultiPartHttpHandler.cs | Adds optional filename assertion for multipart “file” fields. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/SpeechToTextClientIntegrationTests.cs | Adds integration test that exercises auto-detection across multiple audio formats. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/Microsoft.Extensions.AI.Integration.Tests.csproj | Embeds additional audio resource files used by the new integration test. |
Comments suppressed due to low confidence (1)
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs:121
- In
GetStreamingTextAsync,ResolveFilename(audioSpeechStream)is executed unconditionally even for translation requests, but the translation branch immediately delegates toGetTextAsync(...)(which resolves the filename again). With the new magic-byte peek, this results in redundant header reads/rewinds for translation streaming.
_ = Throw.IfNull(audioSpeechStream);
string filename = ResolveFilename(audioSpeechStream);
if (IsTranslationRequest(options))
{
foreach (var update in (await GetTextAsync(audioSpeechStream, options, cancellationToken).ConfigureAwait(false)).ToSpeechToTextResponseUpdates())
| } | ||
|
|
||
| /// <summary>Detects the audio format extension from the leading bytes of the audio data.</summary> | ||
| private static string DetectAudioExtension(ReadOnlySpan<byte> header) |
There was a problem hiding this comment.
For reference, OpenAI supported formats are: mp3, mp4, mpeg, mpga, m4a, wav, and webm. And quotes from the specs related to the matching occurring in this method:
- WAV — RIFF at offset 0, WAVE at offset 8
Source: Microsoft Multimedia Programming Interface and Data Specifications 1.0 (August 1991), referenced from:
https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
Field Length Contents
ckID 4 Chunk ID: "RIFF"
cksize 4 Chunk size: 4+n
WAVEID 4 WAVE ID: "WAVE"
And later under Examples, the full structure shows bytes 0–3 = RIFF, bytes 4–7 = size, and the WAVEID field at bytes 8–11 is WAVE.
- MP3 / MPEG / MPGA — ID3 at offset 0, or frame sync 0xFF 0xE_
Source: http://www.mp3-tech.org/programmer/frame_header.html (authoritative MP3 technical reference, derived from ISO/IEC 11172-3)
Verified citation (exact text):
The first twelve bits (or first eleven bits in the case of the MPEG 2.5 extension) of a frame header are always set to 1 and are called "frame sync".
And the header table shows:
Sign Length (bits) Position (bits) Description
A 11 (31-21) Frame sync (all bits must be set)
11 bits set = bytes 0xFF + top 3 bits of next byte set = (header[1] & 0xE0) == 0xE0
For ID3v2 tags preceding MP3 data:
Source: https://id3.org/id3v2.3.0 — Section 3.1 "ID3v2 header"
"The first three bytes of the tag are always "ID3" to indicate that this is an ID3v2 tag"
- MP4 / M4A — ftyp at offset 4
Source: W3C Note "ISO BMFF Byte Stream Format" (referencing ISO/IEC 14496-12 "ISO Base Media File Format"):
https://www.w3.org/TR/mse-byte-stream-format-isobmff/
Verified citation (exact text):
An ISO BMFF initialization segment is defined in this specification as a single File Type Box (ftyp) followed by a single Movie Box (moov).
Per ISO 14496-12 box format: bytes 0–3 = box size (uint32 big-endian), bytes 4–7 = box type (FourCC). The first box MUST be ftyp.
- WebM — 0x1A 0x45 0xDF 0xA3 at offset 0
Source: RFC 8794 — "Extensible Binary Meta Language" (IETF Standards Track), Section 8.1 "EBML Header":
https://www.rfc-editor.org/rfc/rfc8794.txt
Verified citation (exact text from Section 8.1):
The EBML Header MUST contain a single Master Element with an Element Name of "EBML" and Element ID of "0x1A45DFA3" (see Section 11.2.1)
WebM is a profile of Matroska (RFC 9559), which is an EBML Document Type. Every WebM file begins with the EBML Header whose first element has ID 0x1A45DFA3.
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1467244&view=codecoverage-tab |
| int bytesRead = 0; | ||
| while (bytesRead < header.Length) | ||
| { | ||
| int n = audioSpeechStream.Read(header, bytesRead, header.Length - bytesRead); |
There was a problem hiding this comment.
Are we sure the stream is positioned at the beginning to ensure we are reading the header?
| int bytesRead = 0; | ||
| while (bytesRead < header.Length) | ||
| { | ||
| int n = audioSpeechStream.Read(header, bytesRead, header.Length - bytesRead); |
There was a problem hiding this comment.
Can you use Stream.ReadExactly here, or does that prevent you from being able to reliably rewind after reading in an unsuccessful read-exactly scenario?
Maybe Stream.ReadAtLeast might be appropriate though, with throwOnEndOfStream set to false?
But then again, maybe just having this loop here is the cleanest option, as you're not working against what those convenience APIs are trying to do.
| } | ||
|
|
||
| audioSpeechStream.Position -= bytesRead; | ||
| return $"audio.{DetectAudioExtension(header.AsSpan(0, bytesRead))}"; |
There was a problem hiding this comment.
what happen if we get unrecognized format?
When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format.
Fixes #7543
Microsoft Reviewers: Open in CodeFlow